8 - Linear Regression

In this lecture, we cover the most basic of all (but still quite powerful) regression task: linear regression. We formally define the problem, discuss its empirical version as a maximum likelihood estimator, and bound its sample complexity. We then discuss some generalizations.

8.1 Linear Model of the Data

In the basic setting of linear regression, we assume that there is some "true" linear model of the data, meaning that there is a target weight vector $\mathbf{w}_* \in \mathbb{R}^{d}$ such that for data vectors $\mathbf{x} \in \mathbb{R}^d$ drawn from some unknown distribution, we have

$$ y_* = \mathbf{w}_*^\top \mathbf{x}. $$

Here we assume that the intercept of the linear function is zero, for notational convenience. This is without loss of generality, by the same argument we used when we analyzed the Perceptron algorithm in the last lecture: if there was some nonzero intercept $\theta,$ we could simply enlarge the dimension of the weight vector from $d$ to $d+1$ and "pad" each data vector by adding '1' as the last element. Then the $(d+1)$-th element of the "new" $\mathbf{w}_*$ would encode the intercept.

As usual, we have access to labeled independent examples and our goal is to create a linear model by finding a weight vector $\mathbf{\hat{w}}$ that we can use to predict labels $\hat{y} = \mathbf{\hat{w}}^\top \mathbf{x}'$ on new unseen data vectors $\mathbf{x}'$ drawn from the same distribution as $\mathbf{x}.$

Suppose first that the distribution from which we were drawing $\mathbf{x}$ was "reasonable," in the sense that by drawing $n = C d$ samples for some constant $C \geq 1$, we get $n$ data vectors $\mathbf{x}_1, \mathbf{x}_2, \mathbf{x}_n$ of which $d$ are linearly independent. This would be the case, for example, for data vectors drawn i.i.d. from a Gaussian distribution. Suppose that each vector $\mathbf{x}_i$ was accompanied by its true label $y_{*, i} = \mathbf{w}_*^\top \mathbf{x}_i$. Then we could exactly recover $\mathbf{w_*}$ just by solving the linear system

$$ \mathbf{w}_*^\top \mathbf{x}_i = y_{*, i}, \; \; i = 1, 2, \dots, n $$

and we see that $O(d)$ samples suffice in this case. This type of reasoning can be generalized even if we only get to see noisy labels, where the noise is e.g., subgaussian (which we dicuss next). However, to avoid complicating things too much, we will mainly focus on what is known as "fixed design," described in what follows.

Let us start by now considering a more realistic case of linear regression, where the labels assigned to data vectors do not exactly correspond to a linear function, but can be modeled as

$$ y = \mathbf{w}_*^\top \mathbf{x} + \eta, $$

where we can think of $\eta$ as "noise" or "model misspecification." If we make no assumptions on $\eta$ and only seek to find the "best fit" model (in terms of the square loss; more on that soon), then we would have the agnostic learning setting. We will discuss this case at the end of the lecture.

An illustration of linear regression for $d=2$ is provided in the code below, where the scatter plot shows randomly generated data points with $y = \mathbf{w}_*^\top \mathbf{x} + \eta$ for a fixed linear function determined by $\mathbf{w}_*$ and random noise $\eta$. The plotted 2D plane shows the "best fit" linear function based on the generated data. One could then use this fitted linear function to predict the unknown label $y$ of any given data vector $\mathbf{x} = (x_1, x_2).$

8.2 Fixed Design, Ordinary Least Sqaures as MLE, and Sample Complexity

To get things started and gain some intuition, we begin the discussion assuming that

$$ \eta \sim \mathcal{N}(0, \sigma^2), $$

where $\eta$ is independent of $\mathbf{x}$ and any previous/future samples.

We consider now the "fixed design" setting, which means that we have access to $n$ data samples $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n)$ where the data vectors $\mathbf{x}_1, \dots, \mathbf{x}_n$ are fixed (we can also view this as "conditioning on the data") and the only randomness in the labels comes from i.i.d. noise samples $\eta_1, \eta_2, \dots, \eta_n,$ such that

$$ y_i = \mathbf{w}_*^\top \mathbf{x}_i + \eta_i,\;\; i = 1,2, \dots, n, $$

where $\mathbf{w}_*$ is the unknown target weight vector.

8.2.1 Maximum Likelihood Estimation

In the described setup, clearly, each $y_i$ is an independent Gaussian vector with mean $\mathbf{w}_*^\top \mathbf{x}_i$ and variance $\sigma^2.$ Thus, we can easily write the likelihood function for the unknown $\mathbf{w}_*$ as

$$ f(\mathbf{w}_*) = \Pi_{i=1}^n \phi(y_i|\mathbf{w}_*, \mathbf{x}_i, \sigma^2) = \Pi_{i=1}^n \frac{1}{\sqrt{2\pi} \sigma}e^{-\frac{(y_i - \mathbf{w}_*^\top\mathbf{x}_i)^2}{2\sigma^2}}, $$

where in the equation above $\phi$ denotes the probability density function of $y_i$, given the values of $\mathbf{w}_*, \mathbf{x}_i, \sigma^2$.

Thus, the log-likelihood function is

$$ g(\mathbf{w}_*) = -\log(\sqrt{2\pi}\sigma) - \frac{1}{2\sigma}\sum_{i=1}^n (y_i - \mathbf{w}_*^\top \mathbf{x}_i)^2 $$

and the maximum likelihood estimator of $\mathbf{w}_*$ is the maximizer of this function. Since nothing changes for the solution of an optimization problem if we add a constant to the objective or if we multiply it by a positive constant, finding the MLE of $\mathbf{w}_*$ can equivalently be expressed as

$$ \max_{\mathbf{w}\in \mathbb{R}^d} -\frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2 $$

or, in the minimization form we are more comfortable with as

$$ \min_{\mathbf{w}\in \mathbb{R}^d} \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2. $$

This problem is also known as the ordinary least squares (OLS) problem, while its objective is known as the mean squared error (MSE).

Notice that we can also use MLE to obtain an estimate of the noise variance $\sigma^2,$ by minimizing the log-likelihood over $\sigma^2.$ The resulting estimate $\hat{\sigma}^2$ in this case is the empirical estimate of the variance (do this as an exercise):

$$ \hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{\hat{w}}^\top \mathbf{x}_i)^2. $$

8.2.2 Solving OLS

The OLS problem is often stated in a matrix-vector form, where we "stack" all data vectors as rows in an $n \times d$ matrix $\mathbf{X} = \begin{bmatrix}\mathbf{x}_1^\top\\ \mathbf{x_2}^\top\\ \vdots\\ \mathbf{x}_n^\top\end{bmatrix}$ and we stack all the labels in a vector $\mathbf{y} = \begin{bmatrix}y_1\\ y_2\\ \vdots\\ y_n\end{bmatrix}$ and write it as

$$ \min_{\mathbf{w}\in \mathbb{R}^d} \frac{1}{n}\|\mathbf{X}\mathbf{w} - \mathbf{y}\|_2^2. $$

We have seen problems like this one before. It is an unconstrained minimization problem whose objective is convex (why?), thus we know the solution $\mathbf{\hat{w}}$ to this problem corresponds to the zero gradient of the objective, meaning that

$$ \frac{2}{n} (\mathbf{X}^\top\mathbf{X}\mathbf{\hat{w}} - \mathbf{X}^\top \mathbf{y}) = \mathbf{0}_d. $$

This condition is equivalent to the positive (semi)definite linear system of equations:

$$ \mathbf{X}^\top\mathbf{X}\mathbf{\hat{w}} = \mathbf{X}^\top \mathbf{y} $$

and can be solved by

$$ \mathbf{\hat{w}} = (\mathbf{X}^\top\mathbf{X})^\dagger \mathbf{X}^\top \mathbf{y}, $$

where $(\mathbf{X}^\top\mathbf{X})^\dagger$ denotes the Moore-Penrose pseudoinverse of the matrix $\mathbf{X}^\top\mathbf{X}$. Computing a pseudoinverse can be done in time polynomial in the size of the matrix, so $\mathbf{\hat{w}}$ can be found (exactly) in polynomial time. Alternatively, $\mathbf{\hat{w}}$ can be found by solving the linear system $ \mathbf{X}^\top\mathbf{X}\mathbf{\hat{w}} = \mathbf{X}^\top \mathbf{y} $ using e.g., Gausian elimination, which can also be done in polynomial time in $n, d$. Often, when $n$ and $d$ are large, although polynomial time, either of these algorithms can be quite slow. In such cases, we can tolerate approximate solutions that we can get with efficient algorithms like gradient descent and stochastic gradient descent that we saw in previous lectures on optimization.

8.3 Bounding the Sample Complexity

We now focus on understanding the performance of the OLS estimator $\mathbf{\hat{w}} = (\mathbf{X}^\top\mathbf{X})^\dagger \mathbf{X}^\top \mathbf{y}$ as a function of $n$.

Theorem 1 Under the fixed design model described in this section, where the label noise samples are i.i.d. drawn from $\mathcal{N}(0, \sigma^2),$ we have $$ \frac{1}{n}\mathbb{E}\big[\|\mathbf{X \hat{w} - X w_*}\|_2^2\big] \leq C_1 \sigma^2 \frac{r}{n}, $$ where $r$ is the rank of the matrix $\mathbf{X}^\top \mathbf{X}$ and $C_1$ is an absolute constant. Moreover, for any $\delta > 0,$ we have that with probability at least $1 - \delta$ it holds $$ \frac{1}{n}\|\mathbf{X \hat{w} - X w_*}\|_2^2 \leq C_2 \frac{\sigma^2 r + \log(1/\delta)}{n}, $$ where $C_2$ is an absolute constant.

Proof Let $\boldsymbol{\eta} = [\eta_1, \eta_2, \dots, \eta_n]^\top$ denote the vector of noise observations. Observe first that by definition,

$$ \|\mathbf{y} - \mathbf{X\hat{w}}\|_2^2 \leq \|\mathbf{y} - \mathbf{X{w}_*}\|_2^2 = \|\boldsymbol{\eta}\|_2^2. $$

On the other hand, since $\mathbf{y} = \mathbf{X{w}_*} + \boldsymbol{\eta},$ expanding the square we also have

$$ \|\mathbf{y} - \mathbf{X\hat{w}}\|_2^2 = \|\mathbf{X}\mathbf{\hat{w}} - \mathbf{X{w}_*}\|_2^2 - 2\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*) + \|\boldsymbol{\eta}\|_2^2. $$

Thus, combining these two equations, we have

$$ \|\mathbf{X}\mathbf{\hat{w}} - \mathbf{X{w}_*}\|_2^2 \leq 2\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*) = 2\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2 \frac{\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)}{\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2}. $$

Our next step is to bound the quantity $\frac{\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)}{\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2}$, but we need to be careful to not be too sloppy here and correctly account for the possibly small rank $r$ of $\mathbf{X^\top X}.$ What we know from linear algebra is that there exists an orthonormal basis $\boldsymbol{\Phi} = [\boldsymbol{\phi}_1, \boldsymbol{\phi}_2, \dots, \boldsymbol{\phi}_r]\in \mathbb{R}^{n \times r}$ of the column span of $\mathbf{X}$ such that $\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)$ can be expressed as $\boldsymbol{\Phi}\mathbf{z}$ for some vector $\mathbf{z} \in \mathbb{R}^r.$ As a consequence, we can write

$$ \frac{\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)}{\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2} = \frac{\boldsymbol{\eta}^\top\mathbf{\Phi z}}{\|\mathbf{\Phi z}\|_2} = \frac{\boldsymbol{\eta}^\top\mathbf{\Phi z}}{\|\mathbf{z}\|_2} = \frac{\boldsymbol{\eta}_r^\top\mathbf{z}}{\|\mathbf{z}\|_2}, $$

where $\boldsymbol{\eta}_r := \mathbf{\Phi}^\top\boldsymbol{\eta} = [\boldsymbol{\phi}_1^\top \boldsymbol{\eta}, \boldsymbol{\phi}_2^\top \boldsymbol{\eta}, \dots, \boldsymbol{\phi}_r^\top \boldsymbol{\eta}].$ Because $\boldsymbol{\phi}_1, \boldsymbol{\phi}_2, \dots, \boldsymbol{\phi}_r$ are orthonormal and $\boldsymbol{\eta} \sim \mathcal{N}(\mathbf{0}_n, \sigma^2 \mathbf{I}_n),$ we have that $\boldsymbol{\eta}_r \sim \mathcal{N}(\mathbf{0}_r, \sigma^2 \mathbf{I}_r).$

Now, observe that $\frac{\mathbf{z}}{\|\mathbf{z}\|_2}$ is a unit vector of dimension $r$. This vector is not necessarily independent of $\boldsymbol{\eta}_r,$ as the estimator $\mathbf{\hat{w}}$ depends on the labels and thus on the noise samples $\boldsymbol{\eta}.$ However, using Cauchy-Schwarz inequality, we have

$$ \frac{\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)}{\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2} = \frac{\boldsymbol{\eta}^\top\mathbf{\Phi z}}{\|\mathbf{\Phi z}\|_2} \leq \|\boldsymbol{\eta}_r\|_2. $$

Recalling that $\boldsymbol{\eta}_r \sim \mathcal{N}(\mathbf{0}_r, \sigma^2 \mathbf{I}_r)$ (as argued above), we know that

$$ \mathbb{E}[\|\boldsymbol{\eta}_r\|_2^2] = \sigma^2 r. $$

On the other hand, rearranging the inequality

$$ \|\mathbf{X}\mathbf{\hat{w}} - \mathbf{X{w}_*}\|_2^2 \leq 2\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2 \frac{\boldsymbol{\eta}^\top\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)}{\|\mathbf{X}(\mathbf{\hat{w}} - \mathbf{w}_*)\|_2} $$

that we derived above, we have

$$ \|\mathbf{X}\mathbf{\hat{w}} - \mathbf{X{w}_*}\|_2^2 \leq 4\|\boldsymbol{\eta}_r\|_2^2, $$

and, taking expectations on both sides and dividing by $n$, we get the claimed bound

$$ \mathbb{E}[(1/n)\|\mathbf{X}\mathbf{\hat{w}} - \mathbf{X{w}_*}\|_2^2] \leq 4\frac{\sigma^2r}{n}. $$

To obtain the high probability bound, we look at the moment generating function of $V := \frac{1}{n}\|\mathbf{X}\mathbf{\hat{w}} - \mathbf{X{w}_*}\|_2^2 - 4\frac{\sigma^2r}{n} \leq \frac{4}{n}\|\boldsymbol{\eta}_r\|_2^2 - 4\frac{\sigma^2r}{n}$. Recall that the entries of $\boldsymbol{\eta}_r$ are i.i.d. Gaussian with mean zero and variance $\sigma^2.$ Thus, using independence and grouping the like terms, we can write

\begin{align*} \mathbb{E}[e^{\lambda V}] &\leq \Big(\mathbb{E}[e^{\frac{4 \sigma^2\lambda}{n}(\eta_{r, 1}^2/\sigma^2 - 1)}]\Big)^r. \end{align*}

Now, observe that $\eta_{r, 1}/\sigma$ is standard Gaussian. If you go back to the proof of Theorem 1 in Lecture Note 3, you will see that we have shown there for the moment generating function of $z^2 - 1$ for $z \sim \mathcal{N}(0, 1)$ that for $|\bar{\lambda}| \leq 1/4,$ $\mathbb{E}[e^{\bar{\lambda}(z^2 - 1)}] \leq e^{2\bar{\lambda}}.$ Thus, setting $\bar{\lambda} = \frac{4 \sigma^2\lambda}{n}$, we have that for $\frac{4 \sigma^2|\lambda|}{n} \leq 1/4,$ it holds

$$ \mathbb{E}[e^{\lambda V}] \leq e^{2r \bar{\lambda}^2} = e^{\frac{32 r \sigma^4\lambda^2}{n^2}}. $$

To complete the proof, it remains to apply the Chernoff bound to $V$ (or, equivalently, Markov Inequality to $e^{\lambda V}$), similar to what was done in the proof of Theorem 1 in the Lecture Note 3. This is left as an exercise. $\blacksquare$

8.4 Generalizations

We now discuss some generalizations of the linear regression model discussed above.

8.4.1 Linear Regression in the Agnostic Model

In the agnostic model of computation, we are interested in picking a hypothesis function from the set $\mathcal{H}$ that is the "best-fit" $-$ meaning that it minimizes the error of the model, in this case measured by the squared loss. Compared to the standard PAC learning we introduced in past lectures, we do not assume anymore that there is a function that perfectly fits the data. Instead, we accept to measure how well a chosen model is doing by comparing it to the best possible hypothesis function from the class $\mathcal{H}$.

A Different Path to OLS

OLS is a meaningful problem to solve even in the agnostic (arbitrary) model of noise $\eta.$ Notice that in the above section we reached this problem without a priori choosing our loss function, but by looking for the maximum likelihood estimator of the weight vector for the linear model we were considering. A different path to reaching the OLS problem would have been to consider minimizing the (population) squared loss

$$\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y - \mathbf{w}^\top \mathbf{x})^2],$$

in which case OLS arises as the empirical version of this problem.

Observe that we can equivalently write the squared loss as

\begin{align*} \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y - \mathbf{w}^\top \mathbf{x})^2] &= \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top x + \mathbf{w}_*^\top \mathbf{x} - \mathbf{w}^\top \mathbf{x})^2]\\ &= \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top \mathbf{x})^2] + \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top \mathbf{x} - \mathbf{w}^\top \mathbf{x})^2 + 2 (y -\mathbf{w}_*^\top \mathbf{x})(\mathbf{w}_*^\top \mathbf{x} - \mathbf{w}^\top \mathbf{x})]. \end{align*}

The first term on the right-hand side, $\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top \mathbf{x})^2],$ is called the "irreducible error" -- no algorithm can avoid it. The second part, $\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top \mathbf{x} - \mathbf{w}^\top \mathbf{x})^2 + 2 (y -\mathbf{w}_*^\top \mathbf{x})^\top(\mathbf{w}_*^\top\mathbf{x} - \mathbf{w}^\top \mathbf{x})],$ is called the "reducible error." We can make it arbitrarily small by making $\mathbf{w}$ arbitrarily close to $\mathbf{w}_*$; we do that by drawing sufficiently many samples and running an algorithm to construct a sufficiently good estimate $\mathbf{\hat{w}}$ of vector $\mathbf{w}_*.$

It turns out than even under fairly general (in fact, almost arbitrary) assumptions about the distribution of $(\mathbf{x}, y)$ (and, consequently, $\eta$), OLS on the empirical data is the "right" algorithm for minimizing the (population) squared loss. It is also possible to bound the sample complexity, which we discuss below, after setting up our problem so that we can simpify some of the notation.

Setup

Error/Loss. Observe that our hypothesis class contains all linear functions of the form $h_\mathbf{w}(\mathbf{x}) = \mathbf{w}^\top \mathbf{x},$ while the error is measured by the square loss:

$$ \mathrm{err}(h_\mathbf{w}) = \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y - \mathbf{w}^\top x)^2]. $$

Since, as usual, we will be measuring the error based on $n$ i.i.d. samples $(\mathbf{x}_1, y_1), \dots, (\mathbf{x}_n, y_n),$ the empirical error will simply be the least squares loss (or the mean squared error):

$$ \widehat{\mathrm{err}}(h_\mathbf{w}) = \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2. $$

Let $\mathbf{w}_*$ be the "target" weight vector, that is, the vector that minimizes the squared error, $\mathbf{w}_* \in \arg\min_{\mathbf{w}\in \mathbb{R}^d}\mathrm{err}(h_\mathbf{w})$, and denote

$$ \mathrm{OPT} := \min_{\mathbf{w}\in \mathbb{R}^d}\mathrm{err}(h_\mathbf{w}) = \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y - \mathbf{w}_*^\top x)^2]. $$

Rescaling/Normalization. We now make some simplifying assumptions, which are without loss of generality due to simple rescaling whenever the optimal error $\mathrm{OPT}$ is finite (otherwise we could output an arbitrary hypothesis). In particular, we assume that for any sample $(\mathbf{x}, y)$ we can draw from $\mathcal{D}$ we have

$$ \|\mathbf{x}\|_2 \leq 1 \; \; \text{and}\;\; |y| \leq 1. $$

We also further assume that the target vector is normalized so that

$$ \|\mathbf{w}_*\|_2 \leq 1. $$

Note that this also allows us to restrict our hypothesis class to only contain linear functions determined by unit vectors $\mathbf{w} \in \mathbb{R}^d, \|\mathbf{w}\|_2 \leq 1$.

Observe that after this normalization, we must have $\mathrm{OPT} \leq 4.$

Bounding the Sample Complexity

Given parameters $\epsilon, \delta \in (0, 1)$, our goal is to output some $\hat{\mathbf{w}}$ such that with probability at least $1- \delta,$ we have

$$\mathrm{err}(h_{\hat{\mathbf{w}}}) \leq \mathrm{OPT} + \epsilon.$$

Observe that this the best error we can hope for, since the minimum achievable error is $\mathrm{OPT}.$

Quantization. Recall that when we were bounding the sample complexity of linear classification, we made use of the hypothesis class being finite and combining it with the Hoeffding bound and a union bound. We would like to do something similar here, but unfortunately the class of all linear functions, even under the rescaling/normalization we adopted above, still contains infinitely many possible hypothesis functions (as there are infinitely many unit vectors in $\mathbf{R}^d$). On the other hand, we can only hope to solve our problem to some finite error, so maybe we do not need infinite precision to distinguish between all possible unit vectors $\mathbf{w} \in \mathbb{R}^d$ $-$ and, anyway, our computers only have finite precision $-$ so we may as well quantize this class. In particular, instead of considering all possible unit vectors $\mathbf{w}$, let us consider only those vectors whose entries (in $[-1, 1]$ by rescaling) take values from the discrete $\epsilon'$-grid for some $\epsilon' > 0$ we'll determine later. More precisely, we will restrict ourselves to only consider vectors $\mathbf{w} = [w_1, w_2, \dots, w_d]^\top$ such that each $w_j \in \{0, \pm \epsilon', \pm 2\epsilon', \pm 3 \epsilon', \dots, \pm 1\},$ $j \in \{1, \dots, d\}.$ Observe that there are then at most $\frac{2}{\epsilon'}$ possible values for each coordinate of $\mathbf{w}$ and thus no more than $\big(\frac{2}{\epsilon'}\big)^d$ distinct quantized vectors $\mathbf{w}.$ We denote the class of linear functions determined by such quantized vectors by $\mathcal{H}_{\epsilon'}.$

Bounding the Sampling Error. Suppose that we pick our hypothesis function as the solution to a constrained least squares problem:

$$ \mathbf{\hat{w}} \in \arg\min_{\mathbf{w}\in \mathbb{R}^d: \|\mathbf{w}\|_2 \leq 1} \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{w}^\top \mathbf{x}_i)^2. $$

Here, for simplicity, we will ignore the error from potentially approximately solving the above optimization problem. However, if we wanted, we could reduce our total error budget to $\epsilon/2$ and solve the above problem to error $\epsilon/2,$ which can be done with standard optimization algorithms (many of which are available as a black box in different Python packages).

Observe that by our scalling, for each sample $(\mathbf{x}_i, y_i)$ and $\mathbf{w} \in \mathbb{R}^d,$ $\|\mathbf{w}\|_2 \leq 1,$ we have that $(\mathbf{w}^\top\mathbf{x}_i - y_i)^2 \leq 4$. Thus by a union bound and Hoeffding concentration bound, we have that for any $h_\mathbf{w} \in \mathcal{H}_{\epsilon'}$ and any $t > 0,$

$$ \mathbb{P}[|\widehat{\mathrm{err}}(h_{\mathbf{w}}) - \mathrm{err}(h_{\mathbf{w}})| \geq t] \leq 2|\mathcal{H}_{\epsilon'}|e^{-\frac{2nt^2}{16}} = \Big(\frac{2}{\epsilon'}\Big)^d e^{-\frac{nt^2}{8}}. $$

In particular, to have a stronger statement over the entire class of functions

$$ \mathbb{P}[\exists h_\mathbf{w} \in \mathcal{H}_{\epsilon'}: |\widehat{\mathrm{err}}(h_{\mathbf{w}}) - \mathrm{err}(h_{\mathbf{w}})| \geq \epsilon/4] \leq \delta/2, $$

it suffices to use $n = \big\lceil\frac{128}{\epsilon^2}\big(d\ln(2/\epsilon') + \ln(4/\delta)\big)\big\rceil = O\big(\frac{d\ln(1/\epsilon') + \ln(1/\delta)}{\epsilon^2}\big)$ samples.

A Bound on the Total Error. Let $\mathbf{\hat{w}}_*$ be the "quantized" version of $\mathbf{w}_*,$ obtained by rounding each of its entries to the closes point on the $\epsilon'$-grid of $[-1, 1]$ we discussed above. Because $\mathbf{\hat{w}}$ minimizes the empirical error, we have that

$$ \widehat{\mathrm{err}}(h_{\mathbf{\hat{w}}}) \leq \widehat{\mathrm{err}}(h_{\mathbf{\hat{w}}_{*}}). $$

We will derive the rest of the bound assuming that $\mathbf{\hat{w}}$ quantized, meaning that $h_{\mathbf{\hat{w}}} \in \mathcal{H}_{\epsilon'}$. This of course does not hold in general and we would need to consider a quantized version of $\mathbf{\hat{w}}$ and then bound the error between the two hypotheses. You are encouraged to do this on your own. The rest of the derivation we provide should be sufficient for figuring out how to do that.

Based on the sampling error bound we proved above, we have that with probability at least $1-\delta$,

$$ \mathrm{err}(h_{\mathbf{\hat{w}}}) \leq \widehat{\mathrm{err}}(h_{\mathbf{\hat{w}}}) + \epsilon/4 \leq \widehat{\mathrm{err}}(h_{\mathbf{\hat{w}}_*}) + \epsilon/4 \leq {\mathrm{err}}(h_{\mathbf{\hat{w}}_*}) + \epsilon/2, $$

and, thus, to get what we want, what remains to be done is show that ${\mathrm{err}}(h_{\mathbf{\hat{w}}_*}) \leq \mathrm{OPT} + \epsilon/2.$

Bounding the Error of the Quantized Solution. To bound $\mathrm{err}(h_{\mathbf{\hat{w}}_*}),$ we argue that all we need to do is bound the quantization error $\mathrm{err}_q := \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top x - \mathbf{\hat{w}}_*^\top x)^2]$ by appropriately choosing $\epsilon' > 0.$ To see, this observe that

\begin{align*} \mathrm{err}(h_{\mathbf{\hat{w}}_*}) = \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y - \mathbf{\hat{w}}_*^\top \mathbf{x})^2] &= \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top \mathbf{x} + \mathbf{w}_*^\top \mathbf{x} - \mathbf{\hat{w}}_*^\top \mathbf{x})^2]\\ =\;& \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top x)^2] + \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top\mathbf{x} - \mathbf{\hat{w}}_*^\top \mathbf{x})^2] + 2 \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top\mathbf{x})(\mathbf{w}_*^\top \mathbf{x} - \mathbf{\hat{w}}_*^\top \mathbf{x})]. \end{align*}

Recall that $\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y -\mathbf{w}_*^\top x)^2] = \mathrm{OPT},$ by definition and normalization/rescaling. For the product terms, we use the following inequality, which follows from Cauchy-Schwarz inequality: for any two random variables $X, Y,$ we have $\mathbb{E}[XY] = \sqrt{\mathbb{E}[X^2]\mathbb{E}[Y^2]}.$ Plugging back into the above equation and simplifying, we get:

\begin{align*} \mathrm{err}(h_{\mathbf{\hat{w}}}) \leq \;& \mathrm{OPT} + \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top \mathbf{x} - \mathbf{\hat{w}}_*^\top \mathbf{x})^2]\\ &+ 2 \sqrt{\mathrm{OPT}\,\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top \mathbf{x} - \mathbf{\hat{w}}_*^\top \mathbf{x})^2]}\\ =\;& \mathrm{OPT} + \mathrm{err}_q + 2 \sqrt{\mathrm{OPT}\,\mathrm{err}_q}\\ \leq\; & \mathrm{OPT} + \mathrm{err}_q + 4 \sqrt{\mathrm{err}_q}, \end{align*}

where in the last line we used that $\mathrm{OPT} \leq 4,$ which holds by our scaling/normalization.

Thus, to complete our bound, it remains to argue that we can have $\mathrm{err}_q + 4 \sqrt{\mathrm{err}_q} \leq \epsilon/2$. You can check that $\mathrm{err}_q \leq \frac{\epsilon^2}{64}$ would suffice, so we now focus on choosing $\epsilon'$ so that $\mathrm{err}_q \leq \frac{\epsilon^2}{64}$.

Bounding the Quantization Error. Recalling that $\|\mathbf{x}\|_2 \leq 1$ for any data vector and that all entried of $\mathbf{w}_*$ and $\mathbf{\hat{w}}_*$ must be within $\epsilon'/2$ of each other, we have that by the Cauchy-Schwarz inequality

\begin{align*} \mathrm{err}_q =\;& \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(\mathbf{w}_*^\top \mathbf{x} - \mathbf{\hat{w}}_*^\top \mathbf{x})^2]\\ \leq\;& \mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[\|\mathbf{w}_* - \mathbf{\hat{w}}_*\|_2^2\|\mathbf{x}\|_2^2]\\ \leq\;& \|\mathbf{w}_* - \mathbf{\hat{w}}_*\|_2^2 \\ \leq\;& \frac{(\epsilon')^2}{4}\|\mathbf{1}_d\|_2^2\\ =\; & \frac{(\epsilon')^2 d}{4}. \end{align*}

Thus, it suffices to choose $\epsilon' \propto \epsilon /\sqrt{d}$ (to be precise, since we need $\mathrm{err}_q \leq \epsilon^2/64$, we could choose $\epsilon' = \frac{\epsilon}{4\sqrt{ d}}$).

Final sample complexity bound. To complete our sample complexity bound, it remains to plug the choice of $\epsilon'$ back into the expression for $n.$ Doing so, we conclude that $n = O\big(\frac{d\ln(d/\epsilon) + \ln(1/\delta)}{\epsilon^2}\big)$ samples suffice.

8.4.2 Linear Regression and Nonlinear Models

When we talk about linear regression, we do not necessarily need to consider linear models of the data. In particular, we can "replace" our data vectors by any nonlinear mapping of the data. The only part of the model that needs to remain linear is the dependence on the weight vector $\mathbf{w}.$ When the nonlinear mapping used is a polynomial function of the entries of the data vectors, the resulting linear regression problem is also called "polynomial" regression. As a specific simple but informative example, consider the case where we have 3-dimensional data vectors $\mathbf{x} = [x_1, x_2, x_3]^\top.$ Then we can, for example, consider fitting a model given by a polynomial function like

$$ \hat{y} = w_1 x_1 + w_2 x_2 + w_3 x_3 + w_4 x_1 x_3 + w_5 x_2 x_3^2 + w_6 x_3^3. $$

This now allows us to construct more complex nonlinear models of the data that can fit our data with much lower error. The downside is, of course, that the higher the polynomials we consider, the larger the complexity of the model (which also influences how expensive it is to train and eventually use the constructed model). Additionally, there is a potential for "overfitting" to the data, that would eventually increase the prediction error.

For a more visual understanding of what can happen, consider the following illustration using 1D data.

8.4.3 Generalized Linear Models

Another way of modeling nonlinearities in the data is by considering generalized linear models, which compose the linear mapping with a nonliner one. In particular, such models assign labels as

$$ \hat{y} = \sigma(\mathbf{\hat{w}}^\top \mathbf{x}), $$

where $\mathbf{\hat{w}}$ is the model weight vector defining the linear mapping (as before) and $\sigma: \mathbb{R}\to \mathbb{R}$ is a univariate function called the activation (or link) function. For example, we can choose $\sigma$ as the rectified linear unit (ReLU) function $\sigma(t) = \max\{0, t\}$ or the sigmoid function $\sigma(t) = \frac{e^t}{1 + e^t}.$

The goal could still be to minimize the square loss, which is now for a vector $\mathbf{w}$ defined by

$$\mathbb{E}_{(\mathbf{x}, y)\sim \mathcal{D}}[(y - \sigma(\mathbf{w}^\top x))^2].$$

How to train generalized linear models in the realizable setting $-$ where we assume the existence of a model that perfectly fits the data $-$ is by now well-understood. On the other hand, the agnostic case of this problem is still a topic of active research. It is known that basic nonlinear functions like ReLU getting guarantees like $\mathrm{OPT} + \epsilon$ is computationally intractable even for $\mathbf{x}$ drawn from the standard Gaussian distribution. What is possible are guarantees of the form $C\mathrm{OPT} + \epsilon$, where $C$ is an absolute constant (independent from the dimension and the scaling or $\mathbf{x}, y, \mathbf{w},$ but possibly dependent on the parameters of the distribution of $\mathbf{x}$). Even such guarantees are only computationally tractable for structured classes of distributions. What are the largest such classes of distributions and largest classes of "interesting" (i.e., commonly used in practice) is still actively investigated in current research.

Bibliographical Notes

The material on fixed design is, in part, based on Chapter 2 in lecture notes for 18.S997: High Dimensional Statistics at MIT, as taught by Philippe Rigollet in 2015. Another reference used in preparing this lecture is the book by Shalev-Shwartz and Ben-David ("Understanding Machine Learning: From Theory to Algorithms"); mainly parts of Chapters 4 and 9.